Image upload benchamark by CrabExtra · Pull Request #238 · Devsh-Graphics-Programming/Nabla-Examples-and-Tests

CrabExtra · 2025-12-24T15:15:16Z

Added simple Image uploading benchmark

… FIF

Erfan-Ahmadi · 2026-04-24T07:56:09Z

+	}
+
+
+	double runBenchmarkImageStaging(


this was a failed attempt because we realized we cannot work with preinitialized images and optimal layout in host visible memory, correct?

let's remove the body of the function and replace it with comment, save some LoC and add some documentation here on why we failed.

Erfan-Ahmadi · 2026-04-24T08:47:45Z

+
+static const uint32_t TILE_SIZE = 128u;
+
+[numthreads(128, 1, 1)]


try 2D workgroups.

make workgroup size 128x4 (make sure it reflects on you dispatch) --> 128*4=512 is a good workgroup size, it can fit exactly 3 workgroups on modern SMs

~~pixelPos will be just the global thread idx.~~ we need to have individual offsets for each tile requests

your tile size is PoT, you could just use bitshift with 7 for division << TILE_SIZE_LOG2. for modulo use &127u or &(TILE_SIZE-1)

tileIdx will globalPos.xy << 7 (define TILE_SIZE_LOG2)

localPos will be globalPos.xy&127

your start read location will be tileIdxTILE_SIZETILE_SIZE or tileIdx << 14u

you get it :D I won't continue

Erfan-Ahmadi · 2026-04-24T09:00:21Z

+
+[numthreads(128, 1, 1)]
+[shader("compute")]
+void MortonStore(uint32_t3 ID : SV_DispatchThreadID)


make the workgroup 2d 16x16 or a 512 1D workgroup

but it needs to handle 16x16 region of the tile

your read pos stays the same with flattened global idx (we need to make sure you're reading contigously)

morton::code<false, 7, 2> mc; you now only need 4 bits for 16x16 so it becomes 4,2 I think

use the bitshift and & for division and modulo like my prev commit.

it's very likely the compiler already does this optimization for you since TILE_SIZE is a macro, but it's good practice, in case it changed later to a push constant or something not known at compile time

make sure this change is reflected on your dispatch;

since each tile is 128x128, it'll take 64 workgroups of size 512 to handle copy of a single tile for you

use morton code locally within this 16x16 to figure out the write location + add offset.

doing morton on 16x16 tiles with added offset is no different than doing morton globally on a 128x128 tile (see image below)

Mortong Example for 16x16 group:
thread 0: reads(0,0) at location 0 writes to pixelPos(0,0)
thread 1: reads(1,0) at location 1*ByteSize writes to pixelPos(1,0)
thread 2: reads(0,1) at location 2*ByteSize writes to pixelPos(0,1)
thread 3: reads(1,1) at location 3*ByteSize writes to pixelPos(1,1)
thread 4: reads(2,0) at location 4*ByteSize writes to pixelPos(2,0)
...
make sure this is what happes, reads are contigous, writes are morton ordered
might be actually easier to achieve this with single 1D 512 workgroup, not sure

btw we're not going to go with morton benchmarking any further, but let's just fix it up and make it work as we first intended.

Erfan-Ahmadi · 2026-05-04T08:18:51Z

+	{
+		// Disabled after testing: this path needs CPU writes into host-visible
+		// OPTIMAL images, but the memory layout and preinitialized-image lifetime
+		// rules are too implementation-dependent to make this a clean benchmark.


add: "The devices we tested on didn't allow creating OPTIMAL images over host visible memory"

Erfan-Ahmadi · 2026-05-04T08:19:42Z

+		uint32_t tileSize,
+		uint32_t tileSizeBytes,
+		uint32_t workgroupSizeX,
+		uint32_t workgroupSizeY,


you could use uint32_t2
it's in nbl::hlsl namespace. makes your life easier with 2D/3D params

Erfan-Ahmadi · 2026-05-04T08:42:11Z

+
+[numthreads(128, 4, 1)]
+[shader("compute")]
+void SnakeLoad(uint32_t3 ID : SV_DispatchThreadID)


remove the unused load function. I don't think they are used anymore, right?

Erfan-Ahmadi · 2026-05-04T09:04:19Z

+
+[numthreads(16, 16, 1)]
+[shader("compute")]
+void MortonStore(uint32_t3 ID : SV_GroupThreadID, uint32_t3 GroupID : SV_GroupID)


confusing semantics, you're using ID for GroupThreadID here, but above in SnakeStore you're using the same name for global disaptch ID

CrabExtra added 2 commits December 23, 2025 18:10

Add 73_ImageUploadBenchmark example

6635ba9

Simple benchmark HOST_VISIBLE vs HOST_VISIBLE & DEVICE_LOCAL

951e2fd

Erfan-Ahmadi reviewed Dec 24, 2025

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

Measurment was wierd, added some detail and also fix a bug related to…

141295b

… FIF

Erfan-Ahmadi reviewed Dec 25, 2025

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

Erfan-Ahmadi reviewed Dec 25, 2025

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

Erfan-Ahmadi reviewed Dec 25, 2025

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

Erfan-Ahmadi reviewed Dec 25, 2025

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

Erfan-Ahmadi reviewed Dec 25, 2025

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp

Erfan-Ahmadi reviewed Dec 25, 2025

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp

Erfan-Ahmadi reviewed Dec 25, 2025

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

Erfan-Ahmadi reviewed Dec 25, 2025

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

CrabExtra added 4 commits December 31, 2025 16:29

Resolved PR comments + adding timestamp query

874814a

Adding more logs to release build

ddb7bfc

Added image to image copy

f1fc8d5

compute shader added

7abe408

Erfan-Ahmadi reviewed Feb 26, 2026

View reviewed changes

Comment thread 73_ImageUploadBenchmark/app_resources/tile_upload.comp.hlsl Outdated

fixing PR comments

717f6ae

Erfan-Ahmadi reviewed Apr 24, 2026

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

Erfan-Ahmadi reviewed Apr 24, 2026

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp

Erfan-Ahmadi reviewed Apr 24, 2026

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp

Erfan-Ahmadi reviewed Apr 24, 2026

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

Erfan-Ahmadi reviewed Apr 24, 2026

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp

Erfan-Ahmadi reviewed Apr 24, 2026

View reviewed changes

Comment thread 73_ImageUploadBenchmark/main.cpp

Address image upload benchmark review comments

244802c

Erfan-Ahmadi reviewed May 4, 2026

View reviewed changes


		static const uint32_t TILE_SIZE = 128u;

		[numthreads(128, 1, 1)]

Conversation

CrabExtra commented Dec 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Erfan-Ahmadi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Erfan-Ahmadi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Erfan-Ahmadi Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Erfan-Ahmadi Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Erfan-Ahmadi Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Erfan-Ahmadi May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Erfan-Ahmadi May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Erfan-Ahmadi May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Erfan-Ahmadi May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Erfan-Ahmadi Apr 24, 2026 •

edited

Loading

Erfan-Ahmadi Apr 24, 2026 •

edited

Loading

Erfan-Ahmadi Apr 24, 2026 •

edited

Loading